This is the next installment in my guide to train an LLM on your text message history, as I did over the holidays across my 240,000 messages. In this post, you find out more than you ever wanted to about your own text messages… The previous post is here.
In this part and the next, we will cleanse and curate a dataset of documents that are well suited to Natural Language Generation. This may not sound like the most gripping part of the journey – but actually I had a ton of fun with this stage. In addition, perfecting the data format turned out to have a far greater impact on the results than endlessly tweaking hyper-parameters.
I decided to carry out all my work formatting my text messages locally on my laptop. Then I encrypted the data before uploading to Hugging Face. This shouldn’t be strictly necessary, as the data is kept private by Hugging Face, but I wanted to take this extra precaution to keep my texts safe, and you might want to too.
Getting your hands on the data
Bring up a Jupyter Notebook or Jupyter Lab on your local machine (for example, by creating a virtual env and doing pip install jupyterlab
then jupyter lab
). Then get your environment ready for show time:
# constants
MAX_LENGTH = 200 # The length of each chunk
DATA_NAME = "your-hf-username/messagesv1" # for uploading
ME = "Edward" # your name here!
base_model_name = "meta-llama/Llama-2-7b-chat-hf" # for the tokenizer
# installs
!pip install ipywidgets datasets cryptography torch transformers sentencepiece matplotlib wordcloud
# imports
import csv
import datetime
import random
from collections import Counter
import datasets
from cryptography.fernet import Fernet # to encrypt our texts
import tqdm
import torch
from transformers import AutoTokenizer
import matplotlib
import matplotlib.pyplot as plt
from wordcloud import WordCloud
%matplotlib inline
Put the CSV exports from iMazing into the root directory of your notebook, and run this, updating with your filenames:
txts = []
filenames = ['all_text_messages.csv', 'all_whatsapp_messages.csv']
for filename in filenames:
with open(filename, newline='') as csvfile:
reader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
txts.extend(reader)
print(f'Read in a total of {len(txts):,} messages')
# For me, this outputs: Read in a total of 266,337 messages
It’s worth taking some time to dig in to the data to spot problems — here’s one I found and fixed right away:
INTRO = 'Messages to this chat and calls are now secured with end-to-end encryption'
txts = [txt for txt in txts if txt['Text'] != INTRO]
print(f'Now a total of {len(txts):,} messages')
The messages are currently dicts, and at this point, I convert them to objects for convenience. I filter out messages from groups and people not in my contacts. I also cleanse the text a bit — such as swapping colons with semi-colons, because later we’ll use colons to have a special meaning in the training data.
class Message:
def __init__(self, chat_session, message_type, text, when):
self.name = chat_session
self.sender = self.name if message_type == 'Incoming' else ME
self.receiver = ME if message_type == 'Incoming' else self.name
self.text = text
self.when = datetime.datetime.strptime(when, '%Y-%m-%d %H:%M:%S')
self.massage_text()
def massage_text(self):
# Replace special characters used in our format for training
self.text = self.text.replace('\n',' ').replace(':',';').replace('#',';')
# Indicate if the message is an image
if self.text == '': self.text = '***'
def should_exclude(self):
return any(ch in self.name for ch in '+&,') or all(ch.isdigit() for ch in self.name)
# Create lists of messages
messages = [Message(t['Chat Session'], t['Type'], t['Text'], t['Message Date']) for t in txts]
messages = [m for m in messages if not m.should_exclude()]
print(f'A total of {len(messages):,} messages')
# For me, this outputs: A total of 242,230 messages
Organize the data
We now organize the messages into a dictionary, keyed by the name of the person we’re chatting with. The values are lists of Message objects, sorted by earliest message first. This should interweave text and WhatsApp messages.
# Organize into dict with key = chat name, value = list of messages
chats = {}
for message in messages:
if message.name not in chats:
chats[message.name] = []
chats[message.name].append(message)
# Sort the chats by time
for message_list in chats.values():
message_list.sort(key = lambda m: m.when)
print(f'{sum([len(v) for v in chats.values()]):,} messages with {len(chats)} people')
# Gives me 242,230 messages with 472 people
I decided that I needed to exchange at least 20 text messages with someone for them to be included. This had a marginal affect on the message count, but a large effect on the number of unique people. Definitely something to experiment with.
AT_LEAST = 20
chats = {name: messages for name, messages in chats.items() if len(messages)>=AT_LEAST}
print(f'{sum([len(v) for v in chats.values()]):,} messages with {len(chats)} people')
# Gives me 240,985 messages with 290 people
Investigate the data
It’s always essential to examine your data to look for trends and anomalies. In this case, it’s also super interesting! I started by looking to see how often I’ve texted through the years — guess when I met my partner..
# Prepare data
dates = [message.when for message in messages]
# Plot
fig, ax = plt.subplots(1, 1)
plt.title("How many texts I've sent over time")
ax.set_xlabel('Year')
ax.set_ylabel('How many texts');
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda y, p: format(int(y), ',')))
_ = ax.hist(dates, bins=20, color='purple', rwidth=0.5)
And if you’re curious to see who you’ve texted the most often:
# Prepare data
counter = Counter(message.name for message in messages)
results = counter.most_common(40)
names, counts = zip(*results)
# Plot
fig, ax = plt.subplots(1, 1, figsize = (10, 5))
ax.set_ylabel('How many texts');
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda y, p: format(int(y), ',')))
plt.xticks(range(len(names)), names, rotation='vertical')
_ = ax.bar(names, counts, color ='teal', width = 0.5)
These are my results, with the names masked!
I’ve saved the best to last! Now we create a word cloud of messages to get some insights. For my version here, I’ve limited to the top 50 words to avoid revealing some of my unfortunate nicknames.. but it’s great fun to investigate beyond 50 words! Also try filtering the messages on just you, or on chats with specific people, to see how the tone of your conversations changes.
# Prepare data
text = ' '.join([message.text for message in messages])
# Alternatively - only my sent messages, or only chats with 1 person
# text = ' '.join([message.text for message in messages if message.sender==ME])
# text = ' '.join([message.text for message in chats['recipient name here']])
# Plot
wordcloud = WordCloud(max_font_size=60, max_words=50).generate(text)
plt.figure(figsize = (10, 10))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()
Prep for next time: tokens
In the next post, we’ll start to work with tokens instead of text. If you’re not familiar with tokenization, now would be an excellent time to check out the fantastic video from Jon Krohn that I posted last week. In preparation for next time, we should investigate the Tokenizer used by the Llama 2 model.
tokenizer = AutoTokenizer.from_pretrained(base_model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
We can use the methods encode
, decode
and convert_ids_to_tokens
to investigate how text maps to tokens, and some of the quirks of tokenization.
tokens = tokenizer.encode('### Edward: hello me ### Edward: hi me')
print(tokens)
print(tokenizer.decode(tokens))
print(tokenizer.convert_ids_to_tokens(tokens))
From experimenting with the inputs, you’ll notice a few things:
- The tokenizer adds a beginning of sentence token <s> with value 1
- If you change the input to put each message on a new line (as in,
'### Edward: hello me\n### Edward: hi me'
) then the the second ### is represented differently from the first (broken into 2 tokens). I tried to structure the data to avoid this kind of inconsistency. - You’ll notice that the prompt tags <<SYS>> <</SYS>> and [INST] [/INST] don’t have special tokens; they get tokenized like any other text. I was surprised by this, as were others on the internet… apparently the use of these tags is just a convention that’s been used during training.
Next time, we’ll pack our text messages into datasets, encrypt them and upload to Hugging Face, ready for us to fine-tune our model using QLoRA. The next installment is here.
Leave a Reply